Pyspark with error self. 您所在的位置:网站首页 socket connect timed out Pyspark with error self.

Pyspark with error self.

2023-04-02 04:41| 来源: 网络整理| 查看: 265

Always avoid using UDFs when you can use Spark built-in functions. You can rewrite your logic using when function like this:

from pyspark.sql import functions as F def get_include_col(): c = F.when((F.col("curr_year") == F.col("start_year")) & (F.col("curr_month") >= F.col("start_month")), F.lit(1)) \ .when((F.col("curr_year") == F.col("end_year")) & (F.col("curr_month") F.col("start_year")) & (F.col("curr_year") < F.col("end_year")), F.lit(1)) \ .otherwise(F.lit(0)) return c temp = temp.withColumn('include', get_include_col())

You can also use functools.reduce to dynamically generate the when expressions without having to tape all of them. For example:

import functools from pyspark.sql import functions as F cases = [ ("curr_year = start_year and curr_month >= start_month", 1), ("curr_year = end_year and curr_month start_year and curr_year < end_year", 1) ] include_col = functools.reduce( lambda acc, x: acc.when(F.expr(x[0]), F.lit(x[1])), cases, F ).otherwise(F.lit(0)) temp = temp.withColumn('include', include_col)


【本文地址】

公司简介

联系我们

今日新闻

    推荐新闻

    专题文章
      CopyRight 2018-2019 实验室设备网 版权所有